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Abstract 

We  present  an  overview  of  massively  parallel  deterministic  algorithms  which 
combine  high  fault-tolerance  and  efficiency.  This  desirable  combination  (called 
n>6ustne5s  here]  is  nontrivial,  since  increasing  efficiency  implies  removing  re¬ 
dundancy  whereas  increasing  fault-tolerance  requires  adding  redundancy  to 
computations.  We  study  a  spectrum  of  algorithmic  models  for  which  signif¬ 
icant  robustness  is  achievable,  from  static  fault,  synchronous  computation  to 
dynamic  fault,  asynchronous  computation.  In  addition  to  fail-stop  processor 
models,  we  examine  and  deal  with  arbitrarily  initialized  memory  and  restricted 
memory  access  concurrency.  We  survey  the  deterministic  upper  bounds  for  the 
basic  Wriie-AU  primitive,  the  lower  bounds  on  its  efficiency,  and  we  identify 
some  of  the  key  open  questions.  We  also  generalize  the  robust  computing  of 
functions  to  relations;  this  new  approach  can  model  approximate  computations. 
We  show  how  to  compute  approximate  Write-All  optimally.  Finally,  we  syn¬ 
thesize  the  state-of-the-art  in  a  complexity  classification,  which  extends  with 
fault-tolerance  the  traditional  classification  of  efficient  parallel  algorithms. 

2.2.1  Introduction 

A  basic  problem  of  massively  parallel  computing  is  that  the  unreliability  of 
inexpensive  processors  and  their  interconnection  may  eliminate  any  potential 
efficiency  advantage  of  parallelism.  Our  research  is  an  investigation  of  fault 
models  and  parallel  computation  models  under  which  it  is  possible  to  achieve 
algorithmic  efficiency  (i.e.,  speed-ups  close  to  linear  in  the  number  of  processors) 
despite  the  presence  of  faults.  We  would  like  to  note  that  these  models  can  also 
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be  used  to  explore  common  properties  of  a  broad  spectrum  of  fault-free  models, 
from  synchronous  ptarallel  to  asynchronous  distributed  computing.  Here,  our 
presentation  focuses  on  deterministic  algorithnas  and  complexity,  as  opposed  to 
algorithms  that  use  randomisation. 

There  is  an  Intuitive  trade-ofT  between  reliability  and  efficiency  because  re¬ 
liability  usually  requires  inirodueing  ndundancy  in  the  computation  in  order 
to  detect  errors  and  reassign  resources,  whereas  gaining  efficiency  by  massively 
parallel  computing  requires  removing  redundancy  from  the  computation  to  fully 
utilise  each  processor.  Thus,  even  allowing  for  some  abstraction  in  the  model  of 
parallel  computation,  it  is  not  obvious  that  there  are  any  non-trivial  fault  mod¬ 
els  that  allow  near-linear  speed-ups.  So  it  was  somewhat  surprising  when  in  [17] 
we  demonstrated  that  it  is  possible  to  combine  efficiency  and  fauit-tolerwce  for 
many  basic  algorithms  expressed  as  concurrent-read  concurrent-write  parallel 
(CRCW)  random  access  machines  (PRAMs  [14]). 

The  [17]  fault  model  allows  any  pattern  of  dynamic  fail-stop  no  restart  pro¬ 
cessor  errors,  as  long  as  one  processor  remains  alive.  The  fault  model  was 
applied  to  all  CRCW  prams  in  [23,  40}.  It  was  extended  in  [18]  to  include 
processor  restarts,  and  in  [42]  to  include  arbitrary  static  memory  faults,  i.e., 
arbitrary  memory  initialisation,  and  in  [16]  to  include  restricted  memory  access 
patterns  through  controlled  memory  access.  Concurrency  of  reads  and  writes 
is  an  essential  feature  that  accounts  for  the  necessary  redundancy  so  it  can  be 
restricted  but  not  eliminated  -  see  [16,  17]  for  an  in-depth  discussion  of  this 
issue.  Also,  as  shown  in  [17],  it  suffices  to  consider  COMMON  CRCW  prams 
(all  concurrent  writes  are  identical)  in  which  the  atomically  written  words  need 
only  contain  a  constant  number  of  bits. 

The  work  we  survey  makes  three  key  assumptions.  Namely  that: 

1.  Failure-inducing  adversaries  are  worst-case  for  each  model  and  algorithms 
for  coping  with  them  are  deterministic. 

2.  Processors  can  read  and  write  memory  concurrently  -  except  that  initial 
faults  can  be  handled  without  memory  access  concurrency. 

3.  Processor  faults  do  not  affect  memory  -  except  that  initial  memory  can 
be  contaminated. 

A  central  algorithmic  primitive  in  our  work  is  the  Write-All  operation  [17]. 
Iterated  Write- All  forms  the  basis  for  the  algorithm  simulation  techniques 
of  [23,  40]  and  for  the  memory  initialization  of  [42].  Therefore,  improved  Write- 
All  solutions  lead  to  improved  simulations  and  memory  clearing  techniques. 

The  Write-All  problem  is:  using  P  processors  unite  Is  into  all  locations  of 
an  array  of  size  N,  where  P  <  N.  When  P  =  N  this  operation  captures  the 


computational  progress  that  can  be  naturally  accomplished  in  one  time  unit 
by  a  PRAM.  We  say  that  Wnie-All  eompleiet  ai  ike  global  clock  tick  at  which 
all  the  processors  titat  have  not  fail-siopped  share  the  knowledge  that  l*s  have 
been  written  into  tdl  N  array  locations.  Requiring  completion  of  a  Write-All 
algorithm  is  critical  if  one  wishes  to  iterate  it,  as  pointed  out  in  [23]  which  uses 
a  certification  bit  to  separate  the  various  iterations  of  (Certified)  Write-All. 
Note  that  the  Wriie-All  completes  when  all  processors  halt  in  all  algorithms 
presented  here. 

Under  dynamic  failures,  efficient  deterministic  solutions  to  Write-All,  i.e., 
increasing  the  fault-free  0{N)  work  by  small  polylog(N^)  factors,  are  non- 
obvious.  The  first  such  solution  was  algorithm  W  of  [17]  which  has  (to  date) 
the  best  worst-case  work  bound  0{N  -f  Plog*  JV/loglogN)  for  1  <  P  <  A’. 
This  bound  was  first  shown  in  [22]  for  a  different  version  of  the  algorithm  and 
in  [29]  the  basic  argument  was  adapted  to  algorithm  W. 

Let  us  now  describe  the  contents  of  this  survey,  with  some  pointers  to  the 
literature,  as  well  as  our  new  contributions.  In  Section  2.2.2  we  present  a 
synthesis  of  parallel  computation  and  fault  models.  This  synthesis  is  new  and 
includes  most  of  the  models  proposed  to  date.  It  links  the  work  on  fail-stop  no¬ 
restart  errors,  to  fsdl-stop  errors  with  restarts  (both  detectable  and  undetectable 
restarts). 

The  detectable  restart  case  has  been  examined,  using  a  slightly  different 
formalism  in  [8,  18].  The  undetectable  restart  case  is  equivalent  to  the  most 
general  general  model  of  asynchrony  that  has  received  a  fair  amount  of  attention 
in  the  literature.  An  elegant  deterministic  solution  for  Write- All  in  this  case 
appeared  in  [3].  The  proof  in  [3]  is  existential,  because  it  uses  a  counting 
argument.  It  has  recently  been  made  constructive  in  [33]. 

For  some  important  early  work  on  asynchronous  PRAMs  we  refer  to  [9, 10, 15, 
22,  23,  30,  32,  34].  In  the  last  three  years,  randomised  asynchronous  computa¬ 
tion  has  been  examined  in  depth  in  [4, 5, 21].  These  analyses  involve  randonmess 
in  a  central  way.  They  are  mostly  about  off-line  or  oblivions  adversaries,  which 
cause  faults  during  the  computation  but  pick  the  times  of  these  faults  before 
the  computation.  Although,  we  will  not  survey  this  interesting  subject  here  we 
would  like  to  point-out  that  one  very  promising  direction  involves  combining 
techniques  of  randonused  asynchronous  computation  with  randomized  infor¬ 
mation  dispersal  [36].  The  work  on  fault-tolerant  and  efficient  parallel  shared 
memory  models  has  also  been  applied  to  distributed  message  passing  models; 
for  example  see  [1,  11,  12]. 

In  Section  2.2.3  we  examine  an  array  of  algorithms  for  the  Write- AN  prob¬ 
lem.  These  employ  a  variety  of  deterministic  techniques  and  are  extensible  to 


the  computation  of  other  functions  (see  Section  2.2.4).  In  particular,  in  Sec¬ 
tion  2.2.4,  we  provide  new  bounds  for  fault-tolerant  and  efficient  computation 
of  parallel  prefixes.  In  Section  2.2.5  we  introduce  the  problem  of  approximate 
Write-AU  by  computing  relations  instead  of  functions.  One  new  contribution 
that  we  make  is  to  solve  approximate  Wriie-All  optimally.  In  Section  2.2.6  we 
survey  the  state-of-the-art  in  lower  bounds.  In  Section  2.2.7  we  present  a  new 
complexity  classification  for  fault-tolerant  algorithms.  We  close  with  a  discus¬ 
sion  of  randomised  vs  deterministic  techniques  for  fault-tolerant  imd  efficient 
parallel  computation  (see  Section  2.2.8). 

2.2.2  Fault-tolerant  parallel  computation  models 

In  the  first  subsection  we  detail  a  hierarchy  of  fail-stop  models  of  parallel  com¬ 
putation.  We  then  explmn  the  cost  measures  of  available  processor  steps  and 
overhead  ratio,  which  we  use  to  characterize  robust  algorithms.  The  final  three 
subsections  contidn  comments  on  variations  of  the  processor,  memory,  and  net¬ 
work  interconnect  parts  of  our  models. 

2.2.2.1  Fail-Stop  PRAMs 

The  parallel  random  access  machine  (pram)  of  Fortune  and  Wyllie  [14]  com¬ 
bines  the  simplicity  of  a  ram  with  the  power  of  parallelism,  and  a  wealth  of 
efficient  algorithms  exist  for  it;  see  surveys  [13,  20]  for  the  rationale  behind  this 
model  and  the  fundamental  algorithms.  We  build  our  models  of  fail-stop  prams 
as  extensions  of  the  PRAM  model. 

1.  There  are  Q  shared  memory  cells,  and  the  input  of  size  <  Q  is  stored  in 
the  first  /f  cells.  Except  for  the  cells  holding  the  input,  all  other  memory 
is  cleared,  i.e.,  contains  zeroes.  Each  memory  ceil  can  store  6(log  JF)  bits. 
All  processors  can  access  shared  memory.  For  convenience  we  assume  they 
“know”  the  input  size  N,  i.e.,  the  logliT  bits  describing  it  can  be  part  of 
their  finite  state  control.  For  convenience  we  assume  that  each  processor 
also  has  a  constant  size  private  memory,  that  only  it  can  access. 

2.  There  are  P  <  N  initial  processors  with  unique  identifiers  (pids)  in  the 
range  1, . .  .,P.  Each  processor  “knows”  its  PID  and  the  value  of  P,  i.e., 
these  can  be  part  of  its  finite  state  control. 

3.  The  processors  that  are  active  all  execute  synchronously  as  in  the  stan¬ 
dard  PRAM  model  [14].  Although  processors  proceed  in  synchrony  and 
an  observer  outside  the  pram  can  associate  a  “global  time”  with  every 
event,  the  processors  do  not  have  access  to  “global  time”,  i.e.,  processors 


can  try  to  keep  local  clocks  by  counting  their  steps  and  communicating 
through  shared  memory  but  the  PRAM  does  not  provide  a  “global  clock”. 

4.  Processors  stop  without  affecting  memory.  They  may  also  restart,  de¬ 
pending  on  the  power  of  a  fatM-indueing  adversary. 

In  the  study  of  fail-stop  prams,  we  consider  four  main  types  of  failure- 
inducing  adversaries.  These  form  a  hierarchy,  based  on  their  power.  Note  that, 
each  adversary  is  more  powerful  than  the  preceding  ones  and  that  the  last  case 
can  be  used  to  simulate  fully  asynchronous  processors  [3]. 

InitisJ  faults:  adversary  causes  processor  failures  only  prior  to  the  start  of  the 
computation. 

Fail-stop  failures:  adversary  causes  stop  failures  of  the  processors  during  the 
computation;  there  are  no  restarts. 

Fail-stop  failures,  detectable  restarts:  adversary  causes  stop  failures;  sub¬ 
sequently  to  a  failure,  the  adversary  might  restart  a  processor  and  a 
restarted  processor  “knows”  of  the  restart. 

Fail-stop  failures,  undetectable  restarts:  adversary  causes  stop  failures  and 
restarts;  a  restarted  processor  does  not  necessarily  “know”  of  the  restart. 

Except  for  the  initial  failures  case,  the  adversaries  are  dynamic.  A  major 
characteristic  of  these  adversary  models  is  that  they  are  worst-case.  These  have 
full  information  about  the  structure  and  the  dynamic  behavior  of  the  algorithms 
whose  execution  they  interfere  with,  while  being  completely  unknown  to  the 
algorithms. 

Remark  on  (un)detectable  restarts:  One  way  of  realising  detectable  restarts 
is  by  modifying  the  finite  state  control  of  the  PRAM.  Each  iiutruction  can  have 
two  parts,  a  green  and  a  red  part.  The  green  part  gets  executed  under  normal 
conditions.  If  a  processor  fails  then  all  memory  remiuns  intact,  but  in  the  sub¬ 
sequent  restart  the  next  instruction  red  part  is  executed  instead  of  the  green 
part.  For  example,  the  model  used  in  [8, 18]  can  be  realised  this  way,  instead  of 
using  “update  cycles”.  The  undetectable  restarts  adversary  can  also  be  realised 
in  a  similar  way  by  making  the  algorithm  weaker.  For  undetectable  restarts  al¬ 
gorithms  have  to  have  identical  red  and  green  parts.  For  example,  the  fully 
asynchronous  model  of  [3]  can  be  realised  this  way.  □ 

We  formalise  failiues  as  follows.  A  f^ure  pattern  F  is  syntactically  defined 
as  a  set  of  triples  <tag,  PID,  (  >  where  tag  is  either  failure  indicating  processor 
f^ure,  or  restart  indicating  a  processor  restart,  pid  is  the  processor  identifier, 
and  t  is  the  time  indicating  when  the  processor  stops  or  restarts.  This  time 


Figure  2.2.1:  An  architecture  for  a  faiUstop  multiprocessor. 

is  a  global  time,  that  could  be  assigned  by  an  observer  (or  adversary)  outside 
the  machine.  The  sue  of  the  failure  pattern  F  is  defined  as  the  cardinality  IFI, 
where  jF*!  <  M  for  some  parameter  M. 

The  abstract  model  that  we  are  studying  can  be  realized  in  the  architecture 
in  Fig.  2.2.1.  This  architecture  is  more  abstract  than,  e.g.,  an  implementation 
in  terms  of  hypercubes,  but  it  is  simpler  to  program  in.  Moreover,  various  fault- 
tolerant  technologies  can  contribute  towards  concrete  realizations  of  its  compo¬ 
nents.  There  are  P  faU~atop  processors  [38].  There  are  Q  shared  memory  cells. 
These  semiconductor  memories  can  be  manufactured  with  built-in  fault  toler¬ 
ance  using  replication  and  coding  techniques  [37],  Processors  and  memory  are 
interconnected  via  a  synchronous  network  [39]).  A  combining  interconnection 
network  well  suited  for  implementing  synchronous  concurrent  reads  and  writes 
is  in  [24]  and  can  be  made  more  reliable  by  emplo3ring  redundancy  [2].  In  this 
architecture,  when  the  underlying  hardware  components  are  subject  to  fulures 
within  their  design  parameters,  the  algorithms  we  develop  work  correctly,  and 
within  the  specified  complexity  bounds. 

2.2.2.2  Measures  of  Efficiency 

We  use  a  generalization  of  the  standard  Parallel-iime  x  Proeetsors  product  to 
measure  work  of  an  algorithm  when  the  number  of  processors  performing  work 
fluctuates  due  to  failures  or  delays  (17,  18].  In  the  measure  we  account  for  the 
availahk  processor  steps  and  we  do  not  diarge  for  time  steps  during  which  a 
processor  was  unavailable  due  to  a  f^ure. 

Definition  2.2.1  Consider  a  parallel  computation  with  P  initial  processors 
that  terminates  in  time  r  after  completing  its  task  on  some  input  data  I  of 
rise  JV  and  in  the  presence  of  the  fail-stop  error  pattern  F.  If  Pi{I,  F)  <  P  is 
the  number  of  processors  completing  an  instruction  at  step  i,  then  we  define 
S(J,  F,  P)  as:  S(/,  F,  P)  =  ZU  F).  □ 


Definition  2.2.2  A  P-procesaor  PRAM  algorithm  on  any  input  data  I  of  size 
7|  =  N  and  in  the  presence  of  any  pattern  F  of  failures  of  size  |P|  <  M  uses 
available  procetsor  steps  S  =  Ss,u,p  =  max/,jr{5(/,  F,  P)}  .  □ 

The  available  steps  measure  5  is  used  in  turn  to  define  the  notion  of  algo¬ 
rithm  ro&vstness  that  combines  fault  tolerance  and  efficiency: 

Definition  2.2.S  Let  T{N)  be  the  best  sequential  (ram)  time  bound  known 
for  N-aive  instances  of  a  problem.  We  say  that  a  parallel  algorithm  for  this 
problem  is  a  robust  parallel  tdgoritKm  if:  for  any  input  I  of  size  N  and  for  any 
number  of  initial  processors  P  {I  <  P  <  N)  and  for  any  failure  pattern  F 
of  size  at  most  M  with  at  least  one  surviving  processor  (M  <  N  for  fail-stop 
model),  the  algorithm  completes  its  task  with  5  =  Ss,u,p  <  c  T(JV)log‘  N , 
for  fixed  c,  c'.  O 

For  arbitrary  failures  and  restarts,  the  completed  work  measure  S  depends 
on  the  size  N  of  the  input  J,  the  number  of  processors  P,  and  the  size  of  the 
f^ure  pattern  F.  The  ultimate  performance  goal  is  to  perform  the  required 
computation  at  a  work  cost  as  close  as  possible  to  the  work  performed  by  the 
best  sequential  algorithm  known.  Unfortunately,  this  goal  is  not  attainable 
when  an  adversary  succeeds  in  causing  too  many  processor  failures  during  a 
computation. 

Example:  Consider  a  Wriie-All  solution,  where  it  takes  a  processor  one  in¬ 
struction  to  recover  from  a  fulure.  If  an  adversary  has  a  failure  pattern  F  with 
|P|  =  for  c  >  0,  then  work  will  be  n(JV^+*)  regardless  of  how  efficient 

the  algorithm  is  otherwise. 

This  illustrates  the  need  for  a  measure  of  efficiency  that  is  sensitive  to  both 
the  size  of  the  input  N,  and  the  size  of  the  failure  pattern  |P|  <  M .  We  thus 
also  introduce  the  overhead  ratio  o  that  amortizes  work  of  the  essentird  vmrk 
and  failures: 

Definition  2.2.4  A  P-processor  PRAM  algorithm  on  any  input  data  I  of  size 
|/|  =  jy  and  in  the  presence  of  any  pattern  F  of  failures  and  restarts  of  size 
iPj  <  M  has  overhead  ratio  a  =  as,u,p  =  max/,f  |  |  •  ^ 

When  M  =  0(P)  as  in  the  case  of  the  stop  failures  without  restarts,  S 
properly  describes  the  algorithm  efficiency,  and  a  —  When  F  can 

be  large  relative  to  if  and  P  with  restarts  eimbled,  a  better  reflects  the  efficiency 
of  fault-tolerant  algorithms.  We  can  generalize  the  definition  of  a  in  Def.  2.2.4 
in  terms  of  the  ratio  f  t  where  T{I)  is  the  time  complexity  of  the  best 
known  sequential  solution  for  a  particular  problem. 


2.2.2.S  Proceitor  issues!  survivebility 

We  have  chosea  to  consider  only  the  failure  models  where  the  processors  do  not 
write  any  erroneous  or  maliciously  incorrect  values  to  shared  memory.  While 
malicious  processor  behavior  is  often  considered  in  conjunction  with  message 
passing  systems,  it  makes  less  sense  to  consider  malicious  behavior  in  tightly 
coupled  shared  memory  systems.  This  is  because  even  a  single  faulty  processor 
has  the  potential  of  invalidating  the  results  of  a  computation  in  unit  time,  and 
because  in  a  parallel  system  all  processors  are  normally  “trusted”  agents,  and 
so  the  issues  of  security  are  not  applicable. 

The  fail-stop  model  with  undetectable  restarts  and  dynamic  adversaries  is 
the  most  general  fault  model  we  deal  with.  It  can  be  viewed  as  a  model  of 
parallel  computation  with  arbitrary  asynchrony. 

Remark  on  stronger  survivability  assumption:  The  default  assumption 
we  make  is  that  throughout  the  computation  one  processor  is  fault-free.  This 
assumption  can  be  made  stronger,  i.e.,  a  constant  fraction  of  the  processors  are 
fault-free.  We  always  list  the  stronger  assumption  explicitly  when  used  (e.g.,  in 
the  complexity  classification).  O 

Remark  on  wesJcer  survivability  assumption  and  restarts:  For  the  mod¬ 
els  with  restarts  one  can  use  the  weaker  survivability  assumption  that  at  each 
global  clock  tick  one  processor  step  executes.  In  [18]  this  was  stated  using 
“update  cycles”,  but  it  can  be  stated  using  oui  green-red  instruction  imple¬ 
mentation  -  remark  on  (un)detectable  restarts.  □ 


2.2.2.4  Memory  issues:  words  vs  bits  and  initialisation 

In  our  models  we  assume  that  logJV-bit  word  parallel  writes  are  performed 
atomically  in  unit  time.  The  algorithms  in  such  models  can  be  modified  so  that 
this  restriction  is  relaxed. 

The  sufSdent  definition  of  atomicity  is:  (1)  loglV-site  words  are  written 
using  log  N  bit  write  cycles,  and  (2)  the  adversary  can  cause  arbitrary  fail-stop 
errors  either  before  or  after  the  tingle  hit  write  cycle  of  the  PRAM,  but  not 
during  the  bit  write  cycle. 

The  algorithms  that  assume  word  atomicity  can  be  mechanically  compiled 
into  algorithms  that  assume  only  the  bit  atomicity  as  stated  above. 

A  much  more  important  assumption  in  many  Write-All  solutions  was  the 
initial  state  of  additional  auxiliary  memory  used  (typically  of  fl{P)  sise).  The 
basic  assumption  has  been  that:  Tke  R(P)  auxiliary  shared  memory  it  cleared 
or  initialized  to  tome  known  value. 


While  this  is  consistent  with  definitionB  of  pram  such  as  [14],  it  is  never¬ 
theless  a  requirement  that  fault-tolerant  systems  ought  to  be  able  to  do  with¬ 
out.  Interestingly  there  is  an  efficient  deterministic  procedure  that  solves  the 
Write- All  problem  even  when  the  shared  memory  is  coniaminaied,  i.e.,  contains 
arbitrary  values. 

2.2.2.5  Iiterconnect  issues:  concurrency  vs  redundancy 

The  choice  of  CRCW  (concurrent  read,  concurrent  write)  model  used  here  is 
justified  because  of  a  lower  bound  [17]  that  shows  that  the  crew  (concurrent 
read,  exclusive  write)  model  does  not  admit  fault- tolerant  efficient  algorithms. 
However  we  still  would  like  control  memory  access  concurrency.  We  define 
measures  that  gauge  the  concurrent  memory  accesses  of  a  computation. 

Definition  2.2.5  Consider  a  parallel  computation  with  P  initial  processors 
that  terminates  in  time  r  after  completing  its  task  on  some  input  data  I  of  size 
N  in  the  presence  of  fail-stop  error  pattern  F.  If  at  time  i  (1  <  i  <  r), 
processors  perform  reads  from  shared  memory  locations  and  P^  processors 
perform  writes  to  locations,  then  we  define: 

(i)  the  read  concurrency  p  as:  p  =  Pj,f,p  =  Si=i  {P^  —  .  and 

(ii)  the  imlc  concurrency  u  as:  u>  =  —  EI-i  ~  ^ 

For  a  single  read  from  (write  to)  a  particular  memory  location,  the  read 
(write)  concurrency  p  (w)  for  that  location  is  simply  the  number  of  readers 
(writers)  minus  one.  For  example,  if  only  one  processor  reads  from  (writes  to) 
a  location,  then  p  (w)  is  0,  i.e.,  no  concurrency  is  involved.  Also  note  that  the 
concurrency  measures  p  and  u  are  cumulative  over  a  computation. 

For  the  algorithms  in  the  ERBW  model,  p  =  u  =  0,  while  for  the  CREW 
model,  u  =  0.  Thus  our  measures  capture  one  of  the  key  distinctions  among 
the  EREW,  CREW  and  CRCW  memory  access  disciplines. 

2.2.3  Robust  parallel  assignment  and  Write-All 

2.2.3.1  Write- All  and  initial  faults 

We  first  consider  the  weak  model  of  initial  (static)  faults  in  which  failures  can 
only  occur  prior  to  the  start  of  an  algorithm.  We  assume  that  the  size  of  the 
Write-All  instances  is  N  and  that  we  have  P  processors,  P'  <  P  of  which  are 
alive  at  the  beginning  of  the  algorithm.  Our  brew  algorithm  E  (Fig.  2.2.2) 
consists  of  phases  El  and  E2.  In  phase  El,  processors  enumerate  themselves  and 


01  forall  proceuora  PID=l..P  parbegin 

02  Phase  El:  Use  non-oblivious  puaUel  prefix  to  compute  ranfcp/c  and  P' 
03  Phase  E2:  Set  x[{rankpii>  —  1)  •  ^ .  {rankpiD  *  -  1]  to  1 

04  parend _ _ _  _ 


Figure  2.2.2:  A  high  level  view  of  algorithm  E. 


compute  the  total  numbe  of  live  processors.  The  details  of  this  non-oblivious 
counting  are  in  [16].  In  phase  E2,  the  processors  partition  the  input  array  so 
that  each  processor  is  responsible  for  setting  to  1  all  the  entries  in  its  partition. 

Theorem  2.2.1  The  Write-AU  problem  with  initial  processor  and  memory 
faults  can  be  solved  in  place  with  S  =  0{N  +  P'logP)  on  an  EREW  pram, 
where  1  <  P  <  N  and  P  —  P'  is  the  number  of  initial  faults. 

With  the  result  of  [7]  it  can  be  shown  that  this  algorithm  is  optimal,  without 
memory  access  concurrency. 

2.2.3.2  Dynamic  faults  and  algorithm  W 

A  more  sophisticated  approach  is  necessary  to  obtain  an  efficient  parallel  algo¬ 
rithm  when  the  failures  are  dynamically  determined  by  an  on-line  adversary. 
Algorithm  W  of  [17]  is  an  efficient  fail-stop  Write-All  solution  (Fig.  2.2.3).  It 
uses  full  binary  trees  for  processor  counting,  processor  allocation,  and  progress 
measurement.  Active  processors  synchronously  iterate  through  the  following 
four  phases: 

Wl:  Processor  enumeration.  All  the  processors  traverse  bottom-up  the  pro¬ 
cessor  emuneration  tree.  A  version  of  parallel  prefix  algorithm  is  used 
resulting  in  an  overestimate  of  the  number  of  live  processors. 

W2:  Processor  allocation.  All  the  processors  traverse  the  progress  measure¬ 
ment  tree  top-down  using  a  divide-and-conquer  approach  based  on  pro¬ 
cessor  enumeration  and  are  allocated  to  un-written  input  cells. 

W3:  Work  phase.  Processors  work  at  the  leaves  reached  in  phase  W2. 

W4:  Progress  measurement.  All  the  processors  traverse  bottom-up  the  progress 
tree  using  a  version  of  parallel  prefix  and  compute  an  underestimate  of 
the  progress  of  the  algorithm. 

Algorithm  W  achieves  optimality  when  parameterized  using  a  progress  tree 
with  N/iogN  leaves  and  logJlf  input  data  associated  with  each  of  its  leaves. 
By  optimality  we  mean  that  for  a  range  of  processors  the  work  is  0{N).  A 


01  forall  proceuon  PID=:l..iV  pArbegin 

02  Phase  W3:  Visit  leares  based  on  PID  to  work  on  the  input  data 
03  Phase  W4:  TVaveise  the  ptogiess  tree  bottom  up  to  ntsiisuie  progress 
04  while  the  root  of  the  progress  tree  is  not  N  do 

05  Phase  Wl:  Traverse  counting  tree  bottom  up  to  enumerate  processors 

06  Phase  W2:  Traverse  the  progress  tree  top  down  to  reschedule  work 

07  Phase  WS:  Perform  rescheduled  work  on  the  input  data 

08  Phase  W4:  Traverse  the  progress  ttee  bottom  up  to  measure  progress 

09  od 

10  parend _ 

Figure  2.2.3:  A  high  level  view  of  algorithm  W. 


complete  description  of  the  algorithm  can  be  found  in  [17].  Martel  [29]  gave  a 
tight  analysis  of  algorithm  W. 

Theorem  2.2.2  [17,  29]  Algorithm  W  is  &  robust  parallel  Write- All  algorithm 
with  S  =  0{N  +  Plog*  JV/loglogl\f),  where  N  is  the  input  array  size  and  the 
initial  number  of  processors  P  is  between  1  and  N. 


Note  that  the  above  bound  is  tight  for  algorithm  W.  This  upper  bound  was 
first  shown  in  [22]  for  a  different  algorithm.  The  data  structuring  technique  [22] 
might  lead  to  even  better  bounds  for  WriU-All. 


2.2.3.S  Dynamic  faults,  detected  restarts,  and  algorithm  V 

Algorithm  W  has  efficient  work  when  subjected  to  arbitrary  failure  patterns 
without  restarts  and  it  can  be  extended  to  handle  restarts.  However,  since  ac¬ 
curate  processor  enumeration  is  impossible  if  processors  can  be  restarted  at  any 
time,  the  work  of  the  algorithm  becomes  inefficient  even  for  some  simple  adver¬ 
saries.  On  the  other  hand,  the  second  phase  of  algorithm  W  does  implement 
efficient  top-down  divide-and-conquer  processor  assignment  in  O(logN)  time 
when  permanent  processor  PIDs  are  used.  Therefore  we  produce  a  modified 
version  of  algorithm  W,  that  we  call  V.  To  avoid  a  restatement  of  the  details, 
the  reader  is  referred  to  [18]. 

V  uses  the  optimized  algorithm  W  data  structures  for  progress  estimation 
and  processor  allocation.  The  processors  iterate  through  the  following  three 
phases  based  on  the  phases  W2,  W3  and  W4  of  algorithm  W: 


Vl:  Processors  are  allocated  as  in  the  phase  W2,  but  using  the  permanent 
PiDs.  This  assures  load  balancing  in  0(logi\r)  time. 

V2:  Processors  perform  work,  as  in  the  phase  W3,  at  the  leave'  lliey  reached 
in  phase  VI  (there  are  logi^  array  elements  per  leaf). 

V3:  Processors  continue  from  the  phase  V2  progress  tree  leaves  and  update 
the  progress  tree  bottom  up  as  in  phase  W4  in  C'(log  N)  time. 

The  model  assumes  re-synchronisation  on  the  instruction  level,  and  a  wrap¬ 
around  counter  baaed  on  the  pram  clock  implements  synchronization  with  re¬ 
spect  to  the  phases  after  detected  failures  [18].  The  work  and  the  overhead 
ratio  of  the  algorithm  are  as  follows: 

Theorem  2.2.3  [18]  Algorithm  V  using  P  <  N  processors  subject  to  an  arbi¬ 
trary  failure  and  restart  pattern  F  of  size  M  has  the  work  S  =  0(JV-f  P  log*  N + 
M  log  N),  and  its  overhead  ratio  is:  a  =  0(K  ’*  N). 

Algorithm  ”  achieves  optimality  for  a  non-trivial  set  of  parameters: 

Corollary  2.2.4  Algorithm  V  with  P  <  N/  log*  N  processors  subject  to  an 
arbitrary  failure  and  restart  pattern  of  size  M  <  N/logN  has  5  =  0(N). 

One  problem  with  the  above  approach  is  that  there  could  be  a  large  number 
of  restarts  and  a  large  amount  of  work.  Algorithm  V  can  be  combined  with 
algorithm  X  of  the  next  section  or  with  the  asymptotically  better  algorithm  of 
[3]  to  provide  better  bounds  on  work. 

2.2.S.4  Dynamic  faults,  undetected  restarts,  and  sdgorithm  X 

When  the  failures  cannot  be  detected,  it  is  still  possible  to  achieve  sub-quadratic 
upper  bound  for  any  dynamic  failure/restart  pattern.  We  present  Write- All 
algorithm  X  with  S  =  0{N  •  P*°*  j  )  =  JV  •  P°-®®.  This  simple  2dgorithm  can 
be  improved  to  5  =  0{N  •  P*)  using  the  method  in  [3].  We  present  X  for  its 
simplicity  and  in  the  next  section  a  (possible)  deterministic  version  of  [3]. 

Algorithm  X  utilizes  a  progress  tree  of  size  N  that  is  traversed  by  the 
processors  independently,  not  in  synchronized  phases.  This  reflects  the  local 
nature  of  the  processor  assignment  as  opposed  to  the  global  assignments  used 
in  algorithms  V  and  W.  Each  processor  searches  for  work  in  the  smallest 
subtree  that  has  work  that  needs  to  be  done.  It  performs  the  work,  and  moves 
to  the  next  subtree. 


01  forall  proceuon  PID=0..P  -  1  parbegin 

02  Petfoim  initial  ptocessot  aangnment  to  the  leaves  of  the  progress  tree 

03  while  there  is  still  work  left  in  the  tree  do 

04  if  subtree  rooted  at  current  node  u  is  done  then  move  one  level  up 

OS  elaeif  «  is  a  leaf  then  perform  the  work  at  the  leaf 

06  elseif  «  is  an  interior  tree  node  then 

07  Let  ui,  and  tin  be  the  left  and  right  children  of  u  respectively 

08  if  the  subtrees  rooted  at  uj,  and  un  are  done  then  tt^'date  ti 

09  elseif  only  one  is  done  then  go  to  the  one  that  is  no>  done 

10  else  move  to  ui.  or  tin  according  to  PID  btt  values 

11  fifi 

12  od 

13  parend _ 


Figure  2.2.4:  A  high  level  view  of  the  algorithm  X. 


The  algorithm  is  given  in  Fig.  2.2.4.  Initially  the  P  processors  are  assigned 
to  the  leaves  of  the  progress  tree  (line  02).  The  loop  (lines  03-12)  consists  of 
a  multi-way  decision  (lines  04-11).  If  the  current  node  u  is  marked  done,  the 
processor  moves  up  the  tree  (line  04).  If  the  processor  is  at  a  leaf,  it  performs 
work  (line  05).  If  the  current  node  is  an  unmarked  interior  node  and  both  of 
its  subtrees  are  done,  the  interior  node  is  marked  by  changing  its  value  from 
0  to  1  (line  08).  If  a  single  subtree  is  not  done,  the  processor  moves  down 
appropriately  (line  09).  For  the  final  case  (line  10),  the  processors  move  down 
when  neither  child  is  done.  Here  the  processor  PID  is  used  at  depth  h  of  the 
tree  node:  based  on  the  value  of  the  most  significant  bit  of  the  binary 
representation  of  PID,  bit  0  will  send  the  processor  to  the  left,  and  bit  1  to  the 
right. 

The  performance  of  algorithm  X  is  characterized  as  follows: 

Theorem  2.2.5  Algorithm  X  with  P  processors  solves  the  Write-All  problem 
of  size  N  (P  <  N)  in  the  fail-stop  restartable  model  with  work  5  =  0(iV-P*°*  ^ ). 
In  addition,  there  is  an  adversary  that  forces  algorithm  X  to  perform  S  = 
n(^•.p‘»•f)  work. 

The  algorithm  views  undetected  restarts  as  delays,  and  it  can  be  used  in  the 
asynchronous  model  where  it  has  the  same  work  [8].  Algorithm  X  could  also 
be  useful  for  the  case  without  restarts,  even  though  its  worst-case  performance 
without  restarts  is  no  better  than  algorithm  W. 

Open  Problem:  A  major  open  problem  for  the  model  with  undetectable 
restarts  is  whether  there  is  robust  Write-All  solution,  i.e.,  where  the  work  is 
Npolylog{N).  Also,  whether  there  is  a  solution  with  a  —  polylog{N). 


01  forall  proceMon  PID  =  l.,VN  parbegin 

02  Divide  the  N  wiay  elementa  into  y/N  work  groups  of  y/Jf  elements 
03  Each  processor  obtains  a  private  permutation  Tp,^  of  {1, 2, ,  VN} 
04  for »  =  1..VN  do 

OS  if  I,  [t)th  group  is  not  finished 

06  then  perform  sequential  work  on  the  V'JV  elements  of  the  group 

07  and  mark  the  group  as  finished  1) 

09  od 

10  pstfend _ 


Figure  2.2.5:  A  high  level  view  of  the  algorithm  Y. 


2.2.3.5  Dynamic  faults,  undetected  restarts,  and  algorithm  Y 

A  family  of  randomized  Write-All  algorithms  was  presented  by  Anderson  and 
Woll  [3].  The  main  technique  in  these  algorithms  is  abstracted  in  Fig.  2.2.5. 
The  basic  algorithm  in  [3]  is  obtained  by  randomly  choosing  the  permutation 
in  line  03.  In  this  case  the  expected  work  of  the  algorithm  is  0{N\ogN),  for 
p  =  Vn  (assume  N  is  &  square). 

We  propose  the  following  way  of  determinizing  the  algorithm  (see  [19]): 
Given  P  =  >/N,  we  choose  the  smallest  prime  m  such  that  P  <  m.  Primes  are 
sufficiently  dense,  so  that  there  is  at  least  one  prime  between  P  and  2P,  so  that 
the  complexity  of  the  algorithms  is  not  distorted  when  P  is  not  a  prime.  We 
then  construct  the  multiplication  table  for  the  numbers  1,2,...  m- 1  modulo  m. 
Each  row  of  this  table  is  a  permutation  and  this  structure  is  a  group.  Processor 
with  PID  i  uses  the  ith  permutation  as  its  schedule. 

This  table  need  not  be  pre-computed,  as  any  item  can  be  computed  di¬ 
rectly  by  any  processor  with  the  knowledge  of  its  PID,  and  the  number  of  work 
elements  w  it  has  processed  thus  far  as  {PID  •  w)  mod  m. 

Conjecture:  We  conjecture  that  the  worst  case  work  of  this  deterministic 
algorithm  is  no  worse  than  the  expected  work  of  the  randomized  algorithm. 
Experimental  analysis  supports  the  conjecture.  Formal  analysis  can  be  reduced 
to  the  open  problem  below  that  contains  an  interesting  group-theoretic  aspect 
of  the  multi-processor  scheduling  problem  [41].  In  order  to  show  that  the  worst 
case  work  of  F  is  0(NlogN),  it  is  sufficient  to  show  that: 

Given  a  prime  m,  conrider  the  group  G  =  ({1,2, ...,m  —  l},t  (mod  m)). 

The  multiplication  table  for  G,  when  the  rows  of  the  table  are  interpreted  as 
permutations  of{l,...,m  —  1},  isa  group  K  of  order  m  —  1  (a  subgroup  of 
all  permutations).  Show  that,  for  each  left  coset  of  K  (with  respect  to  all 
permutations)  the  sum  of  the  number  of  left-to-right  maxima  of  all  elements 
of  the  coset  is  0(m log  m). 


01  forall  processors  PtD=l,.P  perbegin - P  processors  clear  N  locations 

02  Clear  the  initial  block  of  Nt  =  Go  elements  sequentially  using  P  processors 
03  «  :=  0  — Iteration  counter 

04  while  Ni  <  N  do 

05  Use  Write-All  solution  with  data  structures  of  sise  Ni  and  Gi^i  elements 
06  at  the  leaves  to  clear  memory  of  sise  Ni^-t  =  Ni  •  Gi^-i ;  t  :=  t  +  1 
07  od 

06  perend _ 

Figure  2.2.6:  A  high  level  view  of  algorithm  Z. 

2.2.3.6  Bootstrapping  and  algorithm  Z 

The  Write-AU  algorithms  and  simulations  (e.g.,  [17,  22,  23,  40])  or  the  algo¬ 
rithms  that  can  serve  as  Write-All  solutions  (e.g.,  the  algorithms  in  [9,  32]) 
invariably  assume  that  a  linear  portion  of  shared  memory  is  either  cleared  or 
is  initialized  to  known  values.  Starting  with  a  non-contaminated  portion  of 
memory,  these  algorithms  perform  their  computation  by  “consuming”  the  clear 
memory,  and  concurrently  or  subsequently  clearing  segments  of  memory  needed 
for  future  iterations.  We  define  an  efficient  Write-All  solution  that  requires  no 
clear  shared  memory  [42]. 

The  solution  uses  a  bootstrap  approach:  In  stage  1  all  P  processors  clear 
an  initial  segment  of  No  locations  in  the  auxiliuy  memory.  In  stage  i  the  P 
processors  clear  Ni^i  =  Ni  •  Gi+i  memory  locations  using  Ni  memory  locations 
that  were  cleared  in  stage  i  -  1. 

Using  algorithm  W  and  tuning  the  parameters  Ni  and  Gi  we  obtain  a  solu¬ 
tion  (algorithm  see  Fig.  2.2.6)  that  for  any  failure  pattern  F  (|F|  <  P)  has 
work  0{N  +  P )  v/itkout  any  initialization  assumption. 

A  similar  algorithm  that  inverts  the  bootstrap  procedure  can  be  used  to  clear 
the  contaminated  shared  memory  if  the  output  must  cont2un  only  the  results 
of  the  intended  computation.  The  complexity  of  algorithm  Z~^  is  identical  to 
the  complexity  of  algorithm  Z.  For  algorithm  simulation  and  for  transformed 
algorithms,  the  complexity  cost  is  additive  in  both  cases. 

2.2.5.7  Minunising  concurrency:  processor  priority  trees 

Among  the  key  lower  bound  results  is  the  fact  that  no  efficient  fault-tolerant 
CREW  PRAM  Write-All  algorithms  exist  [17]  -  if  the  adversary  is  dynamic  then 
any  P-processor  solution  for  theWrite-All  problem  of  size  N  will  have  (de¬ 
terministic)  work  n{N  •  P).  Thus  memory  access  concurrency  is  necessary  to 
combine  efficiency  and  fault-tolerance.  However,  while  most  known  solutions 


for  the  Wriie-All  problem  indeed  make  heavy  use  of  concurrency,  the  goal  of 
minimising  concurrent  access  to  shared  memory  is  attainable. 

We  gave  a  Write-All  algorithm  in  [16]  in  which  we  bound  the  total  amount 
of  concurrency  used  in  terms  of  the  number  of  dynamic  processor  faults  of  the 
actual  run  of  the  algorithm. 

When  there  are  no  faults  our  algorithm  executes  as  an  BRBW  PRAM  and  when 
there  are  faults  the  algorithm  differs  from  BRBW  in  the  amount  of  concurrency 
proportional  to  the  number  of  faults.  The  algorithm  is  based  on  a  conservative 
policy:  concurrent  reads  or  writes  occur  only  when  the  presence  of  failures 
can  be  inferred  and  then  concurrency  is  allowed  in  proportion  to  the  failures 
detected. 

The  robust  CRCW  algorithm  Wcjtiw  in  [16]  is  based  on  algorithm  W  and  it 
uses  processor  identifiers  to  construct  tnergeahle  processor  priority  trees  (PPT), 
which  control  concurrent  access  to  memory.  During  the  execution,  the  PPTs 
are  compacted  and  merged  to  remove  faulty  processors  and  to  determine  when 
concurrent  access  to  memory  is  warranted. 

By  taking  advantage  of  parallel  slackness  and  by  cltistering  the  input  data 
into  groups  of  site  logliflogP,  we  obtain  an  algorithm  that  has  a  range  of 
optimality  and  that  controls  its  memory  access  concurrency; 

Theorem  2.2.0  Algorithm  Wcr/w  [16]  with  input  clustering  is  a  robust 
Write-All  algorithm  with  S  =  0{N write  concurrency  w  <  1F|, 
and  read  concurrency  p<T\F\ logW,  where  1  <P  <  N. 

The  basic  algorithm  can  be  extended  to  handle  arbitrary  initial  memory 
contents  [16].  It  is  also  possible  to  reduce  the  maximum  per  step  memory 
access  concunency  by  polylogarithmic  factors  by  deploying  a  general  pipelining 
technique.  Finally,  [16]  shows  that  there  is  no  robust  algorithm  whose  tot^ 
write  concurrency  is  bounded  by  |F|‘  for  0  <  c  <  1. 

2.2.4  Computing  functions  robustly 

In  this  section  we  will  work  our  way  from  the  simplest  to  the  most  complicated 
functions  with  robust  solutions. 

2.2.4.1  Constants,  booleans  and  Write- All 

Solving  a  Write-All  problem  of  size  N  can  be  viewed  as  computing  a  con¬ 
stant  vector  function.  Constant  scalar  functions  are  the  simplest  possible  func¬ 
tions  (e.g.,  simpler  than  boolean  OR  and  and).  At  the  same  time,  it  appears 


that  Write- All  problem  is  a  more  difficult  (vector)  task  than  computing  scalar 
boolean  functions  such  as  multiple  input  OR  and  AND.  In  the  lower  bounds  dis¬ 
cussion  we  consider  a  model  with  memory  anapshoU,  i.e.,  processors  can  read 
and  process  the  entire  shared  memory  in  unit  time.  For  the  snapshot  model 
there  is  a  sharp  separation  between  Write-All  and  boolean  functions.  Clearly 
any  boolean  can  be  computed  in  constant  time  in  the  snapshot  model,  while 
we  have  a  lower  bound  result  for  any  Write- All  solution  in  the  snapshot  model 
requiring  work 

Solving  a  Write-All  problem  is  no  more  difficult  than  computing  any  other 
vector  function,  e.g.,  parallel  prefix.  In  the  next  subsection  we  also  show  that 
the  best  (as  of  this  writing)  Write-All  solution  can  be  used  to  derive  a  robust 
parallel  prefix  algorithm  that  has  the  same  work  complexity. 

2.2A.2  Parallel  prefix  and  Write* All 

Solutions  for  the  Write-All  problem  can  be  used  as  building  blocks  for  cus¬ 
tom  transformations  of  efficient  parallel  algorithnns  into  robust  algorithms  [17]. 
Ikansformations  are  of  interest  because  in  some  cases  it  is  possible  to  improve 
on  the  work  of  oblivious  simulation  such  as  [23,  32,  40].  These  improvements 
are  most  significant  for  fast  algorithms  when  a  full  range  of  processors  is  used, 
i.e.,  when  N  processors  are  used  to  simulate  N  processors,  because  in  this  case 
parallel  slack  cannot  be  taken  advant^e  of. 

One  immediate  result  that  improves  on  the  avmlable  general  simulations  fol¬ 
lows  from  the  fact  that  algorithms  V,  W  and  by  their  definition,  implement 
an  associative  operation  on  N  values. 

Theorem  2.2,7  Given  any  associative  operation  0  on  integers,  and  an  integer 
array  x[l..J7],  it  is  possible  to  robustly  compute  0^^  z[^  using  P  fail-stop 
processors  at  a  cost  of  a  single  application  of  any  of  the  ^gorithms  V,W  oi  X. 

This  saves  a  full  log  N  factor  for  all  simulations.  The  savings  are  also  pos¬ 
sible  for  the  important  prefix  sums  and  pointer  doubling  algorithms.  Efficient 
parallel  algorithms  and  circuits  for  computing  prefix  sums  were  given  by  Lad¬ 
ner  and  Fischer  in  [26],  where  the  prefix  problem  is  defined  as  follows:  Given  an 
associative  operation  0  on  a  domain  7>,  and  xi,...,Xn  EV,  compute,  for  each 
*.(!<*!<»)  s'lni  ©J_i  *<• 

In  order  to  compute  the  prefix  sums  of  N  values  using  N  processors,  at  least 
log  17/ log  log  .Af  parallel  steps  are  required  [6,  27],  and  the  known  algorithms 
require  at  least  log  17  steps.  Therefore  an  oblivious  simulation  of  a  known  prefix 
algorithm  will  require  simulating  at  least  log  17  steps.  When  using  P  =  N 


proceasora  with  algorithm  W  (the  most  efficient  as  of  this  writing  Write-All 
solution)  whose  work  is  =  0(ff  )>  the  work  of  the  simulation  will  be 

0(S*- log  iV). 

We  can  extend  Theorem  2.2.7  to  show  a  robust  prefix  algorithm  whose  work 
is  the  same  as  that  of  algorithm  W.  In  the  fail-stop  model  we  have  the  following 
result  that  uses  as  the  basis  an  iterative  version  of  the  recursive  algorithm  of  [26]: 

Theorem  2.2.8  Parallel  prefix  for  N  values  can  be  computed  using  W  fail-stop 
processors  using  0(N)  clear  memory  with  S  = 

A  similar  approach  was  also  taken  by  Martel  et  al.  [30]  to  produce  an  efficient 
randomized  transformation  of  the  prefix  algorithm. 

2.2.4.S  List  ranking 

Another  important  improvement  for  the  fml-stop  case  is  for  the  pointer  dou¬ 
bling  operation  that  is  used  in  many  parallel  algorithms.  The  robust  algorithm 
is  implemented  using  a  variation  of  algorithm  W  and  the  standard  pointer 
doubling  algorithm.  We  associate  each  list  element  with  a  progress  tree  leaf. 
In  the  work  phase  of  algorithm  W  we  double  pointers  and  update  distances. 
The  log  N  pointer  doubling  operations  in  the  work  phase  make  log  N/  log  log 
overall  iterations  sufficient  with  each  iteration  performing  the  same  work  S*  as 
algorithm  W. 

Theorem  2.2.9  There  is  a  robust  list  ranking  algorithm  for  the  fail-stop  model 
with  S  =  P)),  where  N  is  the  input  list  size  and  S»(JV,  P)  is 

the  complexity  oi  algorithm  W  for  the  initial  number  of  processors  F  ;  1  < 
P<N. 

This  improvement  can  be  used  with  several  algorithms  based  on  pointer  dou¬ 
bling,  e.g.,  algorithms  for  computing  the  tree  functions  of  Tarjan  and  Vishkin  [43] 
Note  also  that  by  preceding  the  (dgorithm  with  log  N  pointer  doubling  oper¬ 
ations  with  0(J\flogJlf)  additive  overhead,  we  obtmn  a  solution  that  has  no 
asymptotic  degradation  in  the  absence  of  fulures. 

2.2.4.4  General  Parallel  Assignment 

Consider  computing  and  storing  in  an  array  z[l..l\r]  the  values  of  a  vector 
function  /  that  depend  on  PIDs  and  the  initial  values  of  the  array  x.  Assume 
each  of  the  N  scalar  components  of  /  can  be  computed  in  0(1)  sequential  time. 
This  is  the  general  parallel  assignment  problem. 


In  [17]  a  general  technique  wae  shown  for  making  this  operation  robust  using 
the  same  work  as  required  by  Write-AIL  We  modify  the  assignment  so  that  it 
remuns  correct  when  processors  fail  and  when  multiple  attempts  are  made  to 
execute  the  assignment  (assuming  the  surviving  processors  can  be  reassigned 
to  the  tasks  of  faulty  processors).  This  is  done  using  binary  version  numbers 
and  two  generations  of  the  array: 

forall  processors  PID  =  1..N  parbegin 
shared  integer  array  e(0..1][l..l\r}; 
bit  integer  «; 

x[v  +  \][PID]  :=  f(PID,z[v][l..N]y, 

parcnd _ 

Here,  bit  v  is  the  current  version  number  or  tag  (mod  2),  so  that  x[v][l  ...N] 
is  the  array  of  current  values.  Function  /  will  use  only  these  values  of  x  as  its 
input.  The  values  of  /  are  stored  in  x[v-f  l][l . . .  JV]  creating  the  next  generation 
of  array  x.  After  all  the  assignments  are  performed,  the  binary  version  number 
is  incremented  (mod  2). 

At  this  point,  a  simple  transformation  of  any  Write- All  algorithm,  with  the 
modified  general  parallel  aasignmeni  replacing  the  trivial  “xl*)  =  1”  assignment, 
will  yield  a  robust  JV'processor  algorithm: 

Theorem  2.2.10  The  asymptotic  work  complexities  of  solving  the  general  par¬ 
allel  asaignmeni  problem  and  the  Write-All  problem  are  equsJ. 


2.2.4.5  Any  PRAM  steps 

The  original  motivation  for  studying  the  Write- All  problem  was  that  it  captured 
the  essence  of  a  single  PRAM  step  computation.  It  was  shown  in  [23,  40]  how  to 
use  the  Write-All  paradigm  in  implementing  general  PRAM  simulations.  The 
generality  of  this  result  is  somewhat  surprising. 

FaU-atop  faalta:  An  approach  to  such  simulations  is  given  in  Fig.  2.2.7. 
The  simulations  are  implemented  by  robustly  executing  each  of  the  cycles  of 
the  PRAM  step:  instruction  fetch,  read,  compute,  and  write  cycles,  and  next 
instruction  address  computation.  This  is  done  using  two  generations  of  shared 


01  foraU  procttMon  PID=1..P  parbcgin  — Simulate  N  fault-prone  processors 
03  The  PRAM  program  lot  N  proccMor*  it  in  thared  memory  (read-only) 

03  Shared  memory  kaa  two  generations:  current  and  future-, 

04  Initialise  N  simulated  instruction  counters  to  start  at  the  first  instruction 

05  while  there  is  a  simulated  processor  that  has  not  halted  do 

06  —  Tentative  computation:  Fetch  instruction;  Copy  registers  to  scratchpad 

07  Perform  read  cycle  using  current  memory 

08  Perform  the  compute  cycle  using  scratchpad 

09  Perform  write  cycle  into  future  memory 

10  Compute  next  instruction  address 

11  — Reconcile  memory  and  regietert:  Copy  future  locations  to  current 

12  od 

13  parend _ 

Figure  2.2.7:  Simulations  using  WrHe~All  primitive. 

memory,  '^current”  and  ‘future” ,  and  by  executing  each  of  these  cycles  in  the 
general  parallel  assignment  style,  e.g.,  using  algorithm  W. 

Using  such  techniques  it  was  shown  in  [23,  40]  that  if  {N,  P)  is  the  ef¬ 
ficiency  of  solving  a  Wriie-All  instance  of  sise  N  using  P  processors,  and  if  a 
linear  amount  of  clear  memory  is  available,  then  any  JV^-processot  pram  step  can 
be  deterministically  simulated  using  P  fail-stop  processors  and  work  S^{N,P). 
If  the  Parallel-time  x Processors  of  an  original  llT-processor  algorithm  is  r  •  N, 
then  the  work  of  the  fault-tolerant  simulation  will  be  0(r  •  P)). 

The  simulation  in  the  ful-stop  model  is  optimal  for  a  wide  range  of  proces¬ 
sors  [40].  The  following  theorem  might  have  some  practical  significance,  given 
the  constant  overhead. 

Theorem  2.2.11  Any  IV-processor  PRAM  algorithm  can  be  optimally  simu¬ 
lated  (with  constant  overhead)  on  a  fail-stop  JP-processor  CRCW  pram,  when 
P  <  ill  log  log  111/ log^  ill.  Er8W,  crbw,  and  wbak  and  common  crcw  pram 
algorithms  are  simulated  on  fail-stop  common  crcw  prams;  Arbitrary,  pri¬ 
ority  and  STRONG  CRCW  PRAMS  are  simulated  on  fail-stop  prams  of  the  same 
type. 

When  the  full  range  of  simulating  processors  is  used  {N  =  P)  optimality  is 
not  achievable.  In  this  case  customised  transformations  of  parallel  algorithms 
(such  as  our  prefix  and  list  ranking  algorithms)  may  improve  on  the  oblivious 
nmulations. 

Note  that  Theorem  2.2.11  also  holds  when  the  failed  processor  are  restarted 
during  the  simulation  between  the  individual  Write-All  steps. 


Initial  fniUls:  Algorithm  E  can  be  used  for  simulations  of  brew  pram  algo¬ 
rithms  on  f^-stop  BRBW  PRAMS  [16].  Simulations  are  much  simpler  for  this  case 
as  compared  to  the  dynamic  fulures  case.  The  computational  overhead  of  such 
simulations  is  additive.  This  simulation  is  optimal  when  P  ■  r  =  n(P’  logP). 

Theorem  2.2.12  Any  P-processor,  r  parallel  time  BRBW  pram  algorithm  can 
be  robust’y  simulated  on  a  fail-stop  BRBW  pram  that  is  subject  to  static  initial 
processor  .  nd  memory  faults.  The  work  of  the  simulation  is  P  •  t  -f  0(P'  log  P), 
where  P'  is  the  number  of  live  processors. 

Pail-stop  faults  with  detectable  restarts:  There  is  broad  range  of  parameters 
for  the  work  performed  in  executing  a  parallel  algorithm  on  a  faulty  pram  is 
asymptotically  equal  to  the  PandleUiime  x  Processors  product  for  that  algo¬ 
rithm. 

Theorem  2.2. IS  Any  i^-processor  pram  algorithm  can  be  executed  on  a  fail- 
stop  P-processor  CRCW  pram  with  detectable  restarts,  with  P  <  N.  Each  N- 
processor  PRAM  step  is  executed  in  the  presence  of  any  pattern  F  of  failures 
and  restarts  of  sise  M  with:  S  =  0(min{^  -f  P  log*  JV  4-  if  log  JV,  N  •  P'®*  1 }), 
and  overhead  ratio:  o  =  0(log*  N).  &RBW,  CRBW,  and  WEAK  and  common 
CRCW  PRAM  algorithms  are  simulated  on  fail-stop  common  crcw  prams;  Aji- 
BITRARY  and  STRONG  CRCW  PRAMS  are  simulated  on  fail-stop  prams  of  the 
same  type. 

Fail-stop  faults  with  undetectable  restarts:  When  the  failures  are  undetectable, 
deterministic  simulation  become  difficult  due  to  the  possibility  of  processors 
delayed  due  to  failures  writing  stale  values  to  shared  memory.  Fortunately, 
for  fast  polylogarithmic  time  parallel  algorithms  we  can  solve  this  problem  by 
using  polylogarithmkally  more  memory.  We  simply  provide  as  many  *Tuture” 
generations  of  memory  as  there  are  PRAM  steps  to  simulate.  Processor  registers 
are  stored  in  shared  memory  along  with  each  generation  of  shared  memory. 

Prior  to  starting  a  parallel  step  simulation,  a  processor  uses  binary  search 
to  find  the  newest  simulated  step.  When  reading,  a  processor  linearly  searches 
past  generations  of  memory  to  find  the  latest  written  value.  In  the  result  below 
we  use  the  existential  algorithm  [3]. 

Theorem  2.2.14  Any  i^-processor,  log^^*^  JV-time,  Jf-memory  pram  algo¬ 
rithm  can  be  deterministically  executed  on  a  fail-stop  P-processor  crcw  pram 
(P  <  N)  with  undetectable  restarts,  and  using  shared  memory  M  •  log^^*)  N. 
Each  JV-processor  PRAM  step  is  executed  in  the  presence  of  any  pattern  F  of 
failures  and  undetected  restarts  with  5  =  0(l\r*). 


2.2.5  Computing  relations  and  approximate  Write-All 

Here  we  show  that  computing  some  relations  robustly  is  easier  than  computing 
functions  robustly. 

Consider  the  majority  relation  M:  Given  an  array  z[\...N],  z  £  X  when 
>  ^N.  C.  Dwork  observed  that  the  n{N\og  N)  lower  bound 
[22]  on  solving  Write- All  ising  N  processors  also  applies  to  producing  a  member 
of  M  in  the  presence  oi  ^^ures.  It  turns  out  that  0{N\ogN)  work  is  also 
sufficient  to  compute  a  member  of  the  majority  relation. 

Let’s  parameterise  the  majority  problem  in  terms  of  the  apfrozimaie  l^rite- 
All  problem  by  using  a  quantity  e  such  that  0  <  e  <  thus  we  would  like  to 
initialise  at  least  (l-e)N  array  locations  to  1.  We  call  this  problem  the  AWA{e). 
Surprisingly,  algorithm  W  has  the  desired  property: 

Theorem  2.2.15  Given  any  constant  e  such  that  0  <  e  <  ^,  algorithm  W 
solves  the  AWA(e)  problem  with  S  =  O(NiogN)  using  N  processors. 

If  we  choose  e  =  1/2*  (k  =  eanat)  and  then  iterate  this  Write- All  algo¬ 
rithm  loglogiyr  times,  the  number  of  unvisited  leaves  will  be  = 

N(logN)^*‘  =  N(logN)~^  =  JV/log*JV.  Thus  we  can  get  even  closer  to 
solving  the  Write- All  problem: 

Theorem  2.2.16  For  each  k  =  conet,  there  is  a  robust  algorithm 

that  has  work  5  =  0(J\f  log iiTlog log i^T). 

2.2.6  Lower  bounds 

The  strongest  known  lower  bound  for  Write- All  was  derived  by  Kedem,  Palem, 
Ragunathan  and  Spirakis  in  [22]. 

Theorem  2.2.17  [22]  Given  any  P-processor  CRCW  pram  algorithm  for  the 
Wnte-AIl problem  of  sise  N,  an  adversary  can  force  fail-stop  (no  restart)  errors 
that  result  in  N  -l-n(PlogJ7)  (where  P  <  N)  steps  being  performed. 

Recently,  Martel  and  Subramonian  [31]  have  extended  the  Kedem  et  al. 
deterministic  lower  bound  [22]  to  randomized  algorithms  against  oblivious  ad¬ 
versaries.  It  is  open  whether  this  lower  bound  applies  to  the  static  fault  case. 

It  was  shown  in  [17]  that  no  optimal  solutions  for  the  Write-All  problem 
exist  that  use  the  range  of  processor  l<  P  <  N  even  when  the  processors  can 
take  tnsiani  memory  enapehots,  i.e.,  processors  can  read  and  locally  process  the 


entire  shared  memory  at  unit  cost.  The  lower  bound  below  applies  to  fail-stop, 
deterministic  or  randomised,  prams  and  it  is  the  strongest  possible  bound  under 
the  memory  snapshots  assumption,  i.e.,  there  is  a  matching  upper  bound. 

Theorem  2.2.18  [17]  Given  any  J^-processor  CRCW  pram  algorithm  for  the 
Write-All  problem  of  sise  N,  an  adversary  can  force  fail-stop  errors  that  result 
in  steps  being  performei',  even  if  the  processors  can  read  and 

locally  process  all  shared  memory  at  un  t  cost. 

When  restarts  are  introduced,  we  show  the  following  result  that  also  is  the 
strongest  possible  result  under  the  snapshot  assumption  [8]: 

Theorem  2.2.19  Given  any  P-processor  CRCW  pram  algorithm  that  solves 
the  Write-All  problem  of  size  N  (P  <  N),  an  adversary  (that  can  cause 
arbitrary  processor  failures  and  restarts)  can  force  the  algorithm  to  perform 
N  -1-  n(P  log  P)  work  steps. 

The  next  result  shows  that  CRCW  is  necessary  to  achieve  efficient  solutions 
to  the  Write-All  problem.  In  the  absence  of  failures,  any  P-processor  crew 
(concurrent  read  exclusive  write)  or  BREW  (exclusive  read  exclusive  write)  PRAM 
can  simulate  a  P-processor  crcw  pram  with  only  a  factor  of  O(logP)  more 
parallel  work  [20].  However  a  more  severe  difference  exists  between  crcw  and 
crew  prams  (and  thus  also  ERBW  prams)  when  the  processors  are  subject  to 
failures. 

Theorem  2.2.20  Given  any  detenmnistic  or  randomised  jY-processor  crew 
PRAM  algorithm  for  the  Write-All  problem,  the  adversary  can  force  fail-stop 
errors  that  result  in  n(lV^)  steps  being  performed,  even  if  the  processors  can 
read  and  locally  process  all  shared  memory  at  unit  cost. 

For  the  CREW  prams,  Martel  and  Subramonian  [31]  show  a  randomized 
algorithm  with  expected  work  of  only  0{N  log  N)  for  P  =  N. 

2.2.7  A  Complenty  classification 

2.2.7.1  Efficient  parallel  computation 

Many  efficient  parallel  algorithms  can  be  used  to  show  problem  membership  in 
the  class  ^^C  (of  polylog  time  and  polynomial  number  of  processors  [35]).  The 
inverse  is  not  necessarily  true.  This  is  because  the  algorithms  in  AfC  allow  for 
polynomial  inefficiency  in  work  [25]  -  the  algorithms  are  fast  (polylogarithmic 


time),  but  the  computational  agent  can  be  large  (polynomial)  relative  to  the 
size  of  a  problem  [35]. 

A  chuacterization  of  parallel  algorithm  efficiency  that  takes  .o  account 
both  the  parallel  time  and  the  size  of  the  computational  resource  is  defined  by 
Vitter  and  Simmons  [44]  and  expanded  on  by  Kruskal  et  al.  [25].  The  complexity 
classes  in  [25]  are  defin^  with  respect  to  the  time  comolexity  T{N)  of  the  best 
sequential  algorithm  for  a  problem  of  size  N  ~  this  is  an  ^ogous  to  the  definition 
of  Tohustneas.  Each  class  is  characterized  in  terms  of  parallel  time  t{N)  and, 
parallel  work  t(N)  •  We  give  these  class  definitions  below,  but  instead 

of  failure-free  work,  we  use  the  overhead  ratio  a  that  for  the  failure-free  case  is 
simply  t{N)  ■  P{N)/T{N): 

Let  A  be  a  problem  with  sequential  (ram)  time  complexity  T{N).  A  parallel 
algorithm  that  solves  an  N-aize  instance  of  A  using  P{N)  processors  in  t{N) 
time  belongs  to  the  class: 


ENC: 

ifr(W)  =  log°<'Hr(i7)) 

and  a  =  0(1). 

EP: 

if  r{N)  <  T(Ny  (const  e  <  1) 

and  <r  =  0(1). 

ANC-. 

ifr(iV)  =  log^(^)(T(17)) 

and  ff  =  log®<‘)(T(JV)). 

AP: 

if  r(N)  <  T(Ny  (const  e  <  1) 

and  (7  =  log°(^)(T(Ar)). 

SNC: 

ifr(I7)  =  log®(^)(T(i7)) 

and  o  =  T(1V)‘>(^). 

SP: 

if  t{N)  <  T{Ny  (const  c  <  1) 

and  (T  = 

2.2. 7.2  Closures  under  failures 

We  now  define  criteria  for  evaluating  whether  algorithm  transformation  pre¬ 
serves  the  efficiency  of  the  algorithms  for  each  of  the  classes  above. 

To  use  time  complexity  in  comparisons,  we  need  to  introduce  a  measure  of 
time  for  the  fault-tolerant  algorithms.  In  a  fault-prone  environment,  a  time 
metric  is  meaningful  provided  that  a  ngnificant  number  of  processors  still  are 
active.  Here  we  use  the  worst  case  time  provided  a  linear  number  of  processors 
are  active  during  the  computation.  This  is  our  weak  survivability  assumption. 
Without  this  assumption,  all  one  can  conclude  about  the  running  *ime  is  that 
it  is  no  better  than  the  time  of  the  best  sequential  algorithm,  since  the  number 
of  active  processors  might  become  quite  small. 

We  assuming  P  is  a  polynomial  in  N  (note  that  until  now  we  generally 
assumed  P  <  N).  Then  log  P  =  0(log  N).  We  now  state  the  definition: 


Complexity 

ClaM 

Time  with  >  eP  proceMora 
0(r(JV)lo**  M/loslotAf) 

Overhead  a- 
0(logO(') 

Cloaed 
under  f  ? 

BNC 

=  0(logO(»)(T(N))) 

>0(1) 

No 

EP 

=  0(TW) 

>0(1) 

No 

ANC 

=  log‘^<‘)(T(M)) 

=  log®0)(T(iV)) 

Yei 

AP 

=  0(T(Ny) 

=  iog®o)(r(iv)) 

Yet 

SNC 

=  iog°<‘)(T(Ar)) 

Yet 

SP 

=  0(T(W)‘) 

Yet 

Table  2.2.1:  Closure  under  the  fail-stop  transformation 


Definition  2.2.6  Let  Cr,w  be  a  class  with  parallel  time  in  the  complexity  class 
r  and  parallel  work  in  the  complexity  class  w.  We  say  that  is  closed  with 
respect  to  a  fault-tolerant  transformation  if  for  any  algorithm  A  in  Cr,^: 
(1)  overhead  er  of  ^(.4)  is  such  that  a  •  t  •  P  ia  in  w,  imd  (2)  when  the  number 
of  active  processors  at  any  point  of  the  computation  is  at  least  cP  for  constant 
c>  0,  then  the  running  time  t  is  in  r.  □ 

In  the  fail-stop  model  without  restarts,  given  any  algorithm  A,  let  ((A)  be 
the  fault-tolerant  algorithm  that  can  be  constructed  as  either  a  simulation  or  a 
transformation. 

Using,  for  example,  algorithm  W  as  the  basis  for  transforming  non-fault- 
tolerant  algorithms,  we  have  the  following: 

(1)  The  multiplicative  overhead  in  work  is  0(logJV*/loglogAf),  and  so  the 

worst  case  overhead  a  is  0(logiV*/loglogA^)  =  N  and  the  worst  case 

work  of  the  fault-tolerant  version  ^(A)  is  <r  •  t{N)  •  P. 

(2)  Algorithm  W  terminates  in  S^/cP  =  0(log’ JV/loglogill)  time  when  at 
least  cP  processors  are  active,  therefore  if  the  parallel  time  of  algorithm  A 
is  t{N),  then  the  parallel  time  of  execution  foi  ((A)  using  at  least  cP  active 
processors  is  0(T(JV)log*  JlT/loglog  JIT). 

The  resulting  closure  properties  of  the  classes  in  [25]  under  our  fail-stop 
transformation  is  siunmarixed  in  Table  2.2.1. 

In  the  fail-stop  model  with  detectable  rest2u^tB,  for  any  algorithm  A,  let 
p(A)  be  the  fault-tolerant  algorithm  constructed  using  any  of  our  techniques. 
In  this  model  we  provide  existential  closure  properties  by  taking  advantage  of 
the  existential  result  by  Anderson  and  Woll  [3],  who  showed  that  for  every 
e  >  0,  there  exists  a  deterministic  algorithm  for  P  processors  that  simulates  P 
instructions  with  0{P^'*’*)  work.  Given  the  algorithm  [3],  we  interleave  it  with 
algorithm  V,  for  example,  so  that  the  overhead  <r  of  the  combined  algorithm 


Complexity 

CUu 

Time  with  >  cP  proceison 

Overbeed  a 
0{log»  N) 

Cloeed 
under  p? 

ENC 

>  log^('HT(N)) 

>0(1) 

No 

BP 

=  o(T(Ny) 

>0(1) 

No 

ANC 

>  )(T(iy)) 

Unknown 

AP 

=  0(T(N)*) 

=  logO(')(T(N)) 

Yes 

SNC 

>  iog‘’(*)(r(Ar)) 

=  r(N)OO) 

Unknown 

SP 

0(T(A')‘) 

=  T(N)®0) 

Yet 

Table  2.2.2:  Closure  under  the  restartable  fail-stop  transformation  p. 


is  0(log^  N).  Table  2.2.2  gives  the  closure  properties  under  the  restartable 
fail-stop  transformation.  Note  that  due  to  the  lower  bounds  for  the  Write- 
All  problem,  the  entries  that  are  marked  “No”  mean  non-closure,  while  the 
“Unknown”  result  means  that  closure  is  not  achieved  with  the  known  results. 

2.2.8  Discussion:  on  randomization  and  approximation 

We  have  presented  an  overview  of  the  theory  of  efficient  and  fault-tolerant  par¬ 
allel  algorithms.  Our  focus  has  been  deterministic  algorithms,  partly  because 
our  work  has  concentrated  on  this  topic,  but  also  because  many  deterministic 
techniques  exist  for  the  problems  of  interest. 

We  close  our  exposition  with  an  observation  (by  D.  Michailidis)  that  illus¬ 
trates  the  power  of  rjutdo.:u2ation  (vs  determinism).  As  we  described  above 
deterministic  Write- All  solutions  require  logarithnuc  time.  This  is  true  even 
for  approximate  Write-All.  However: 

Theorem  2.2.21  The  approximate  Write- All  problem  (AWA)  of  size  N  where 
the  number  of  locations  to  be  written  is  iV'  =  aN  and  the  number  of  surviving 
processors  is  at  least  ^N,  for  some  constants  0  <  a,/3  <  1  can  be  solved 
probabilistically  (error  is  Monte  Carlo)  on  a  CRCW  pram  with  0{N)  expected 
work  in  0(1)  parallel  steps. 

Randomization  is  an  important  algorithmic  tool  which  has  had  extensive 
and  fruitful  application  to  fault- tolerance,  e.g.,  [36].  Probabilistic  techniques 
have  played  a  key  role  in  the  analysis  of  asynchronous  parallel  computing  -  see 
for  example,  [4,  5,  9,  10,  15,  22,  23,  21,  30,  32,  34).  Note  however,  that  it  is 
often  hard  to  compare  the  analytical  bounds  of  deterministic  vs  randomized 
algorithms,  since  much  of  the  randomized  analysis  is  done  using  an  oblivious 
adversary  assumption. 


Randomised  algorithms  often  achieve  better  practical  performance  than  de¬ 
terministic  ones,  even  when  their  analytical  bounds  are  similar.  Future  devel¬ 
opments  in  asynchronous  parallel  computation  will  employ  randomization  as 
well  as  the  array  of  deterministic  techniques  surveyed  here. 
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